Objective

The objective is to classify user with respect to the label attribute, based on the other attributes in the dataset.

Abstract

(Details and visualization are embedded below)

1. Data cleansing and preprocessing

Handling missing values:

  • There were no missing values in this dataset

Analysing distributions:

  • web_site - the attribute had 2 categories that represented 52% of the examples with 0 occurrence and high cordinality.

  • user_identifier - has a very long tail distribution

  • request_time - spans over 2 months during 2019 with additive seasonality during weekends

  • browser, os and state - dosn't have high cordinality. they seem too add more data with various distributions.

Target attribute:

  • There are ~4.8K values in this column and the whole data is ~291k examples. with money loss of 2K

Feature representation

  • web_site - 2 large categories with no clicks were filtered in order to help the model. and with less at 1K occurrences categories were factorized into a single category

  • user_identifier - the entire dataset was aggregated by this attribute it could also been embaded with other attributes

  • request_time - was split into multiple attributes: month, day, week name, hour

  • All feature were transformed into their one-hot representation after user_identifier aggrigation

2. Modeling

Fitting a baseline model:

  • The first thing to do is to define a model that will be our baseline. For this purpose XGBoost algorithm was chosen which is simple, yet very scalable and outputs probabilaties for classification (as opposed to SVM model)

Fitting multiple algorithms' models

  • an A/B test was conducted based on two algorithms: XGBoost, Random-forest
  • Hyper-parameters Grid-search was used

3. Evaluation

Assumption:

  • The business implecations of such a classifer can vary, but for this exercise let's assume that main goal is to profit above with no impact on how much money is invested as long as there is a profitabilaty. so we can use the probabilaty of the model with excpected value in order to set some thershold

objective function:

  • Customized objective function could have been used, with the excpected value integrated into the function

Choosing the right metric:

  • The natural choice should be the Accuracy metric, only it might be not indicative enough in our imbalanced data. Let's break down the importance of the classifcation quality: on the one hand we want to capture as much instances of a given class (recall), but we also want that transactions will not be missclassified (precision). so the best evaluation metric for our problem is the F1 score which is basecally the harmonic mean of recall and precitions.
In [1]:
from __future__ import print_function
import os
import pickle
import numpy as np
import pandas as pd
import re

from sklearn import preprocessing
from sklearn.metrics import recall_score, precision_score, f1_score
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from xgboost import XGBClassifier

import plotly.express as px


from datetime import datetime
get_ipython().run_line_magic('matplotlib', 'inline')



pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 999)

df = pd.read_csv('exam_exam_ds.csv')
In [2]:
print(df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 290991 entries, 0 to 290990
Data columns (total 7 columns):
request_time       290991 non-null object
browser            290991 non-null object
os                 290991 non-null object
state              290991 non-null object
web_site           290991 non-null object
user_identifier    290991 non-null object
label              290991 non-null int64
dtypes: int64(1), object(6)
memory usage: 15.5+ MB
None
In [3]:
number_of_clicks =  df['label'].sum()
cur_profit = number_of_clicks * 60 - df['label'].count()
print('There are {} viewes of ads with {} of clicks and a total loss of {}'.format(df['label'].count(), number_of_clicks, cur_profit))
There are 290991 viewes of ads with 4811 of clicks and a total loss of -2331
In [4]:
def count_val(x_axis, order_by=False):
    tmp = df.groupby(x_axis)['label'].agg(['count', 'mean']).reset_index()
    if order_by:
        fig = px.scatter(tmp, x=x_axis, y="count", size='mean').update_xaxes(categoryorder="total descending")
    else:        
        fig = px.scatter(tmp, x=x_axis, y="count", size='mean')
    fig.show()
    
def dateparser(i):
    parsed_date = re.split('\.|\s[A-z]',  i)
    time_stamp = datetime.strptime(parsed_date[0], '%Y-%m-%d %H:%M:%S')
    pd_date = pd.to_datetime(time_stamp)
    return pd.Series({'date':pd_date.strftime('%d/%m/%Y'), 
                 'month': pd_date.month,
                 'day': pd_date.day,
                 'weekday': pd_date.day_name(),
                 'hour': pd_date.strftime('%H'),
                'time': pd_date.time()
                })

def merge_classes(column, min_records, encode=True):
    counts = column.value_counts()
    merged_column = column.copy()
    merged_column.loc[counts[column].values < min_records] = 'other'
    print('Number of classes in {} after merge: {}'.format(column._name, len(merged_column.value_counts())))

    if encode:
        encoder = preprocessing.LabelEncoder()
        merged_column = encoder.fit_transform(merged_column)

    return merged_column
In [5]:
df_dates = df['request_time'].apply(dateparser)
df = pd.concat([df, df_dates], axis=1)
In [6]:
count_val("web_site", True)
In [7]:
df = df[~df['web_site'].isin(['hdpopcorns.co', 'cyberreel.com'])]

the threshold of 1000+ was chosen mainly for computational reasons

In [8]:
df['web_site_cat_1000'] = merge_classes(df["web_site"], 1000, False)
Number of classes in web_site after merge: 8
In [9]:
count_val('browser', True)
In [10]:
count_val('date')
In [11]:
count_val('weekday')
In [12]:
cols = df.columns.drop(['request_time', 'label', 'time', 'user_identifier', 'date', 'web_site'])
df[cols].nunique()
Out[12]:
browser              22
os                    6
state                39
month                 2
day                  31
weekday               7
hour                 24
web_site_cat_1000     8
dtype: int64
In [13]:
df_agg = df.groupby(['user_identifier', 'label'], as_index=False)[cols].agg(lambda x: list(x))
df_agg.head()
df_agg.shape
Out[13]:
(82094, 10)
In [14]:
df_agg.sample(10)
Out[14]:
user_identifier label browser os state month day weekday hour web_site_cat_1000
16029 143.210.12 0 [CHROME77, CHROME77] [WINDOWS_10, WINDOWS_10] [MA, MA] [10, 10] [11, 11] [Friday, Friday] [18, 18] [other, other]
68984 65.237.162 0 [CHROME78] [WINDOWS_10] [UT] [11] [19] [Tuesday] [23] [other]
45144 225.71.111 0 [CHROME73, CHROME73] [CHROME_OS, CHROME_OS] [WA, WA] [11, 11] [10, 10] [Sunday, Sunday] [17, 17] [fmovies.se, fmovies.se]
56657 3.107.136 0 [CHROME78, CHROME78] [WINDOWS_10, WINDOWS_10] [UT, UT] [11, 11] [12, 12] [Tuesday, Tuesday] [16, 16] [other, other]
6106 113.177.13 0 [CHROME78] [WINDOWS_10] [TN] [11] [10] [Sunday] [00] [other]
14851 14.211.183 0 [CHROME78] [WINDOWS_10] [CO] [11] [7] [Thursday] [08] [other]
2066 104.126.18 0 [CHROME64, CHROME64] [MAC_OS_X, MAC_OS_X] [FL, FL] [11, 11] [13, 13] [Wednesday, Wednesday] [10, 23] [imdark.com, imdark.com]
11084 13.168.88 0 [CHROME77, CHROME77] [MAC_OS_X, MAC_OS_X] [PA, PA] [10, 10] [26, 26] [Saturday, Saturday] [19, 19] [other, other]
24374 167.166.12 0 [CHROME78, CHROME78] [WINDOWS_7, WINDOWS_7] [CT, CT] [11, 11] [11, 11] [Monday, Monday] [22, 22] [other, other]
65483 55.226.209 0 [CHROME77] [CHROME_OS] [TX] [11] [4] [Monday] [18] [other]
In [15]:
from sklearn.preprocessing import MultiLabelBinarizer

def to_one_hot(col,df):
    mlb = MultiLabelBinarizer()
    res = pd.DataFrame(mlb.fit_transform(df[col]),
                       columns=['{}_'.format(col) + str(x) for x in mlb.classes_],
                       index=df.user_identifier).max(level=0)
    return res
In [16]:
df_onehot = df_agg 
for i in cols:
    df_onehot = df_onehot.merge(to_one_hot(i, df_agg), on='user_identifier')
In [17]:
df_final = df_onehot.drop(cols, axis=1)
df_final.shape
Out[17]:
(82094, 141)
In [18]:
df_final.sample(10)
Out[18]:
user_identifier label browser_CHROME62 browser_CHROME63 browser_CHROME64 browser_CHROME65 browser_CHROME66 browser_CHROME67 browser_CHROME68 browser_CHROME69 browser_CHROME70 browser_CHROME71 browser_CHROME72 browser_CHROME73 browser_CHROME74 browser_CHROME75 browser_CHROME76 browser_CHROME77 browser_CHROME78 browser_FIREFOX66 browser_FIREFOX67 browser_FIREFOX68 browser_FIREFOX69 browser_FIREFOX70 os_CHROME_OS os_MAC_OS_X os_WINDOWS_10 os_WINDOWS_7 os_WINDOWS_8 os_WINDOWS_81 state_AL state_AR state_AZ state_CA state_CO state_CT state_DC state_DE state_FL state_GA state_IL state_IN state_KS state_KY state_LA state_MA state_MD state_ME state_MI state_MN state_MO state_MS state_NH state_NJ state_NM state_NY state_OH state_OR state_PA state_SC state_TN state_TX state_UNKNOWN state_UT state_VA state_VT state_WA state_WI state_WV month_10 month_11 day_1 day_2 day_3 day_4 day_5 day_6 day_7 day_8 day_9 day_10 day_11 day_12 day_13 day_14 day_15 day_16 day_17 day_18 day_19 day_20 day_21 day_22 day_23 day_24 day_25 day_26 day_27 day_28 day_29 day_30 day_31 weekday_Friday weekday_Monday weekday_Saturday weekday_Sunday weekday_Thursday weekday_Tuesday weekday_Wednesday hour_00 hour_01 hour_02 hour_03 hour_04 hour_05 hour_06 hour_07 hour_08 hour_09 hour_10 hour_11 hour_12 hour_13 hour_14 hour_15 hour_16 hour_17 hour_18 hour_19 hour_20 hour_21 hour_22 hour_23 web_site_cat_1000_citationmachine.com web_site_cat_1000_crackstream.com web_site_cat_1000_fmovies.se web_site_cat_1000_imdark.com web_site_cat_1000_other web_site_cat_1000_scribble.io web_site_cat_1000_searchsafe.co web_site_cat_1000_speedtest.com
48426 234.93.122 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
80277 95.137.21 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0
30527 185.177.94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0
14112 138.41.182 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
80639 95.94.7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
76178 85.154.88 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
41014 215.252.51 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
57965 32.94.114 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
29494 182.157.57 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
67558 61.11.16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0

2. Modeling

In [19]:
y = df_final['label']
X = df_final.drop(['label', 'user_identifier'], axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)
print("Train set data shape: {} \ny_train labeled data shape: {}".format(X_train.shape, y_train.sum()))
Train set data shape: (61570, 139) 
y_train labeled data shape: 3621
In [38]:
def evaluate_with_price(x, y_true, clf, threshold=0.0164):
    y_pred = clf.predict_proba(x)
    tmp = pd.DataFrame(np.column_stack([y_pred, y_true.to_numpy()]), columns=['pred_0', 'pred_1', 'true'])
    tmp['pred_value'] = np.where(tmp['pred_1']>threshold, 1, 0)
    tmp['expected_value'] = 0
    tmp.loc[(tmp['pred_value'] ==1) & (tmp['true']==0), 'expected_value'] = -1
    tmp.loc[(tmp['pred_value'] ==1) & (tmp['true']==1), 'expected_value'] = 60
    print('Score after tranformation \nRecall: {:.4}\nPrecision: {:.4}\nF1: {:.4}'.format(recall_score(y_true, tmp['pred_value'].values),
                                                     precision_score(y_true, tmp['pred_value'].values), 
                                                     f1_score(y_true, tmp['pred_value'].values)))
    cur_profit = tmp['expected_value'].sum()
    print('There were {:,} clicks on ads with predicted {:,} clicks and a total profit of {:,}'.format(tmp['true'].sum(),
                                                                                           tmp['pred_value'].sum(),
                                                                                          cur_profit))
    return tmp

def plot_pred_hist(axis='pred_1'):
    fig = px.histogram(results, x=axis, color="true", histnorm='probability density')
    fig.show()

Model training:

  • logloss was chosen as the objective function since probabilities are important in the prediction and is more sensetive to wrong probabilities
  • scale_pos_weight hyper parameter is used to help with the unbalanced data set
In [21]:
eval_set = [(X_train, y_train), (X_test, y_test)]
ratio = (y_train==False).sum()/(y_train==True).sum()
early_stopping_rounds = 10

clf = XGBClassifier(
        eval_metric = "logloss",
        objective='binary:logistic',
        random_state=1,
        scale_pos_weight = ratio
        )


params = {
          'min_child_weight': [0.5, 1, 6],
          'learning_rate':[0.1, 0.3, 1, 5],
          'n_estimators':[300, 400, 600]
        }

scoring = ['roc_auc', 'f1', 'recall', 'precision']

GsCv = RandomizedSearchCV(clf,
                        params, 
                        n_iter=4,
                        cv=3,
                        scoring = scoring,
                        refit = 'f1').fit(X_train, y_train,
                            eval_set=eval_set,
                            early_stopping_rounds=early_stopping_rounds,
                            verbose=100)
[0]	validation_0-logloss:0.750336	validation_1-logloss:0.751144
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
Stopping. Best iteration:
[0]	validation_0-logloss:0.750336	validation_1-logloss:0.751144

[0]	validation_0-logloss:0.747783	validation_1-logloss:0.750263
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
Stopping. Best iteration:
[0]	validation_0-logloss:0.747783	validation_1-logloss:0.750263

[0]	validation_0-logloss:0.861543	validation_1-logloss:0.868727
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
Stopping. Best iteration:
[0]	validation_0-logloss:0.861543	validation_1-logloss:0.868727

[0]	validation_0-logloss:0.690434	validation_1-logloss:0.690409
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
[100]	validation_0-logloss:0.643993	validation_1-logloss:0.648029
[200]	validation_0-logloss:0.63051	validation_1-logloss:0.636948
[300]	validation_0-logloss:0.620317	validation_1-logloss:0.629452
[399]	validation_0-logloss:0.612728	validation_1-logloss:0.62313
[0]	validation_0-logloss:0.690416	validation_1-logloss:0.690416
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
[100]	validation_0-logloss:0.643461	validation_1-logloss:0.646295
[200]	validation_0-logloss:0.629942	validation_1-logloss:0.635865
[300]	validation_0-logloss:0.620562	validation_1-logloss:0.628511
[399]	validation_0-logloss:0.613196	validation_1-logloss:0.622939
[0]	validation_0-logloss:0.690112	validation_1-logloss:0.690207
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
[100]	validation_0-logloss:0.642237	validation_1-logloss:0.644878
[200]	validation_0-logloss:0.629009	validation_1-logloss:0.63375
[300]	validation_0-logloss:0.619873	validation_1-logloss:0.626341
[399]	validation_0-logloss:0.612569	validation_1-logloss:0.620724
[0]	validation_0-logloss:0.686026	validation_1-logloss:0.685963
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
[100]	validation_0-logloss:0.619371	validation_1-logloss:0.628475
[200]	validation_0-logloss:0.598136	validation_1-logloss:0.61345
[300]	validation_0-logloss:0.580929	validation_1-logloss:0.600876
[399]	validation_0-logloss:0.569493	validation_1-logloss:0.592748
[0]	validation_0-logloss:0.685947	validation_1-logloss:0.685827
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
[100]	validation_0-logloss:0.619368	validation_1-logloss:0.627065
[200]	validation_0-logloss:0.600439	validation_1-logloss:0.612083
[300]	validation_0-logloss:0.584065	validation_1-logloss:0.599822
[399]	validation_0-logloss:0.571413	validation_1-logloss:0.590988
[0]	validation_0-logloss:0.685391	validation_1-logloss:0.685766
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
[100]	validation_0-logloss:0.617945	validation_1-logloss:0.624732
[200]	validation_0-logloss:0.597152	validation_1-logloss:0.608577
[300]	validation_0-logloss:0.581004	validation_1-logloss:0.5961
[399]	validation_0-logloss:0.568293	validation_1-logloss:0.587424
[0]	validation_0-logloss:0.690434	validation_1-logloss:0.690409
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
[100]	validation_0-logloss:0.644893	validation_1-logloss:0.648452
[200]	validation_0-logloss:0.630848	validation_1-logloss:0.636892
[300]	validation_0-logloss:0.620749	validation_1-logloss:0.629228
[400]	validation_0-logloss:0.61251	validation_1-logloss:0.622916
[500]	validation_0-logloss:0.605613	validation_1-logloss:0.617547
[599]	validation_0-logloss:0.599315	validation_1-logloss:0.612859
[0]	validation_0-logloss:0.690407	validation_1-logloss:0.690358
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
[100]	validation_0-logloss:0.643468	validation_1-logloss:0.646122
[200]	validation_0-logloss:0.630043	validation_1-logloss:0.635474
[300]	validation_0-logloss:0.620916	validation_1-logloss:0.628406
[400]	validation_0-logloss:0.613097	validation_1-logloss:0.622152
[500]	validation_0-logloss:0.605919	validation_1-logloss:0.616438
[599]	validation_0-logloss:0.599893	validation_1-logloss:0.611833
[0]	validation_0-logloss:0.690129	validation_1-logloss:0.690248
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
[100]	validation_0-logloss:0.642954	validation_1-logloss:0.645506
[200]	validation_0-logloss:0.629865	validation_1-logloss:0.634249
[300]	validation_0-logloss:0.620374	validation_1-logloss:0.626458
[400]	validation_0-logloss:0.612403	validation_1-logloss:0.619952
[500]	validation_0-logloss:0.605925	validation_1-logloss:0.615136
[599]	validation_0-logloss:0.59993	validation_1-logloss:0.610703
[0]	validation_0-logloss:0.690349	validation_1-logloss:0.690455
Multiple eval metrics have been passed: 'validation_1-logloss' will be used for early stopping.

Will train until validation_1-logloss hasn't improved in 10 rounds.
[100]	validation_0-logloss:0.646619	validation_1-logloss:0.64942
[200]	validation_0-logloss:0.635208	validation_1-logloss:0.640505
[300]	validation_0-logloss:0.627488	validation_1-logloss:0.635067
[399]	validation_0-logloss:0.621347	validation_1-logloss:0.630525
In [22]:
print(GsCv.best_params_)
pd.DataFrame(GsCv.cv_results_).sort_values('rank_test_'+GsCv.refit).filter(regex='mean_|param_',axis=1)
{'n_estimators': 400, 'min_child_weight': 0.5, 'learning_rate': 0.1}
Out[22]:
mean_fit_time mean_score_time param_n_estimators param_min_child_weight param_learning_rate mean_test_roc_auc mean_test_f1 mean_test_recall mean_test_precision
1 132.477945 1.004839 400 0.5 0.1 0.624623 0.149275 0.510909 0.087410
3 198.232631 1.352835 600 6 0.1 0.616957 0.146844 0.495167 0.086207
2 132.361558 1.032892 400 6 0.3 0.603519 0.141209 0.455951 0.083544
0 4.086651 0.550472 300 1 5 0.557244 0.131300 0.331397 0.100085
In [23]:
bst = GsCv.best_estimator_.fit(X_train, y_train,
                            eval_set=eval_set,
                            early_stopping_rounds=early_stopping_rounds,
                            verbose=False)

3. Evaluation

In [26]:
results = evaluate_with_price(X_test, y_test, bst)
Score after tranformation 
Recall: 0.9991
Precision: 0.05801
F1: 0.1097
There were 1,158.0 clicks on ads with predicted 19,945 clicks and a total profit of 50,632
In [27]:
plot_pred_hist()
In [28]:
results = evaluate_with_price(X_test, y_test, bst, 0.7)
Score after tranformation 
Recall: 0.08549
Precision: 0.1557
F1: 0.1104
There were 1,158.0 clicks on ads with predicted 636 clicks and a total profit of 5,403

Checking the data set with Random forest

In [32]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_jobs=-1, 
                            n_estimators=100, 
                            min_samples_leaf=2,
                            max_features=None,
                            class_weight='balanced',
                            verbose=False).fit(X_train, y_train)
In [33]:
results = evaluate_with_price(X_test, y_test, rf)
Score after tranformation 
Recall: 0.8092
Precision: 0.06157
F1: 0.1144
There were 1,158.0 clicks on ads with predicted 15,218 clicks and a total profit of 41,939
In [34]:
plot_pred_hist()

when we compare the models, they seem to have similar results